Integrate Automated QDQ autotuner - part 3.2#838
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughIntroduces a new ONNX quantization autotuning module that enables automatic Q/DQ (Quantize/Dequantize) node insertion and optimization using pattern-based region analysis. Provides a comprehensive framework for discovering optimal insertion points, profiling schemes, and exporting quantized models. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Autotuner as QDQAutotuner
participant RegionSearch as CombinedRegionSearch
participant Profiler as Profiling System
participant Inserter as Q/DQ Insertion
participant Exporter as ONNX Exporter
User->>Autotuner: initialize(config, pattern_cache)
Autotuner->>Autotuner: Load model & init state
User->>RegionSearch: discover regions
RegionSearch-->>Autotuner: return regions
loop For each region
User->>Autotuner: set_profile_region(region)
Autotuner->>Autotuner: Commit profiling outcomes
Autotuner->>Profiler: Prepare region-pattern pairs
loop Generate candidates
User->>Autotuner: generate()
Autotuner->>Inserter: Build insertion scheme
Inserter->>Inserter: Insert Q/DQ nodes
User->>Autotuner: submit(latency_ms)
Autotuner->>Autotuner: Track performance metrics
end
end
User->>Autotuner: export_onnx(best=True)
Autotuner->>Inserter: Apply best scheme
Inserter->>Exporter: Finalize Q/DQ graph
Exporter-->>User: return quantized ONNX bytes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@modelopt/onnx/quantization/autotune/autotuner.py`:
- Around line 1024-1029: The try/except around graph.cleanup().toposort()
swallows all exceptions (except Exception as e) and merely logs a warning, which
can hide serious graph corruption; update the handler in autotuner.py to either
catch only expected exception types (e.g., specific cleanup/toposort exceptions)
or log the error and re-raise it so execution stops on unexpected failures —
locate the graph.cleanup().toposort() call and replace the broad except with
either a narrowed except for known recoverable exceptions or add a raise after
logger.warning/failure log to propagate the error.
- Line 622: Remove the redundant local import "from datetime import datetime"
(the one added at line with the single import statement) in autotuner.py; the
module already imports datetime at the top of the file, so delete this local
import to avoid duplication and potential shadowing (look for the statement
"from datetime import datetime" inside the function or block and remove it).
- Around line 912-918: The zero-point arrays q_zp_values (and the corresponding
dq_zp_values) are created with a hardcoded dtype np.int8 which can mismatch the
QuantizeLinear/DequantizeLinear output type when quant_type is "uint8" or other
types; update their construction to use the same dtype as the computed
quant_dtype instead of np.int8 so q_zp_values and dq_zp_values match the
quantized output element type used when building q_inputs and dq_inputs (refer
to q_scale_values, q_zp_values, q_inputs and the corresponding dq_* variables to
locate where to change the dtype).
- Around line 1013-1021: The import of get_tensor_consumer_node_indices is wrong
and causes an import error; replace that import with get_tensor_consumer_nodes
and update any usage names accordingly (the code that uses tensor_users_map
already expects a defaultdict(list) so no KeyError handling is needed).
Specifically, change the symbol imported from
modelopt.onnx.quantization.graph_utils from get_tensor_consumer_node_indices to
get_tensor_consumer_nodes and ensure tensor_users_map is assigned from
get_tensor_consumer_nodes(...) where used in the autotuner (references:
get_tensor_consumer_node_indices, get_tensor_consumer_nodes, tensor_users_map).
🧹 Nitpick comments (4)
modelopt/onnx/quantization/autotune/autotuner.py (4)
229-229: Consider defining config attributes explicitly.Using
getattr(self.config, "maximum_generation_attempts", 100)with defaults (also seen at lines 718-719 and 744) suggests these attributes may not be formally defined on theConfigclass. This pattern makes it harder to discover available configuration options.💡 Suggestion
Consider adding these attributes to the
Configclass with documented defaults rather than relying ongetattrfallbacks:# In Config class maximum_generation_attempts: int = 100 top_percent_to_mutate: float = 0.1 minimum_schemes_to_mutate: int = 1 maximum_mutations: int = 3
333-335: Replace assertions with explicit checks for runtime validation.Assertions on lines 333-335 (and similarly at line 314) are used for validating runtime conditions. Since assertions can be disabled with
python -O, these should be explicit checks for production code.🛡️ Proposed fix
- full_insertion_scheme = pattern.get_full_insertion_scheme(region, self.graph) - assert full_insertion_scheme is not None - all_region_ips = pattern.matches(region, self.graph, full_insertion_scheme) - assert isinstance(all_region_ips, set) + full_insertion_scheme = pattern.get_full_insertion_scheme(region, self.graph) + if full_insertion_scheme is None: + logger.warning(f"Failed to get full insertion scheme for region {region.id}") + continue + all_region_ips = pattern.matches(region, self.graph, full_insertion_scheme) + if not isinstance(all_region_ips, set): + raise TypeError(f"Expected set from pattern.matches, got {type(all_region_ips)}")
972-985: Assertions used for critical runtime validation.These assertions validate critical invariants (node index bounds, input index bounds, tensor name matching) but can be disabled with
python -O. Consider using explicit checks withValueError/IndexErrorfor production safety.🛡️ Proposed fix
if node_index is not None: - assert node_index < len(graph.nodes), "Node index out of range" + if node_index >= len(graph.nodes): + raise IndexError(f"Node index {node_index} out of range (max: {len(graph.nodes) - 1})") target_node = graph.nodes[node_index] - assert input_index is not None, "Input index must be set when node index is set" - assert input_index < len(target_node.inputs), ( - f"Input index out of range for node {target_node.name}" - ) + if input_index is None: + raise ValueError("Input index must be set when node index is set") + if input_index >= len(target_node.inputs): + raise IndexError(f"Input index {input_index} out of range for node {target_node.name}") original_tensor = target_node.inputs[input_index] - assert tensor_name == original_tensor.name, ( - f"Tensor name mismatch for node {target_node.name} input {input_index}" - ) + if tensor_name != original_tensor.name: + raise ValueError(f"Tensor name mismatch: expected '{tensor_name}', got '{original_tensor.name}'") else: - assert tensor_name in tensor_map, f"Tensor {tensor_name} not found in tensor map" - assert input_index is None, "Input index must be None when node index is None" + if tensor_name not in tensor_map: + raise KeyError(f"Tensor {tensor_name} not found in tensor map") + if input_index is not None: + raise ValueError("Input index must be None when node index is None")
1042-1049: Consider iterative approach for deep region hierarchies.
_visit_region_recursivelyuses recursion which could hit Python's stack limit for very deep region hierarchies. While this is unlikely for typical ONNX models, an iterative approach would be more robust.♻️ Iterative alternative
def _visit_region_recursively(self, region: Region) -> list[Region]: """Iteratively traverse region hierarchy and collect all regions.""" regions = [] stack = [region] while stack: current = stack.pop() regions.append(current) stack.extend(current.get_children()) return regions
1ffcf7f to
bd18dfa
Compare
tests/unit/onnx/quantization/autotune/autotune/test_autotuner.py
Outdated
Show resolved
Hide resolved
|
/ok to test b02fef1 |
c5340bb to
68ffa5a
Compare
|
/ok to test 68ffa5a |
68ffa5a to
ebba6e6
Compare
|
/ok to test 6124ac9 |
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
4ad0b32 to
9a94dc4
Compare
Review: PR #838 - Integrate Automated QDQ autotuner - part 3.2Overall AssessmentThis PR implements the core QDQAutotuner class. The code is well-structured with good documentation and tests, but several critical and high-priority issues need fixing. 🚨 Critical Issues1. Import Error:File: (line 44)
|
@willg-nv please address the critical and high priority issues raised above before merging. Thanks. |
Review: PR #838 - Integrate Automated QDQ autotuner - part 3.2Overall AssessmentThis PR implements the core QDQAutotuner class. The code is well-structured with good documentation and tests, but several critical and high-priority issues need fixing. 🚨 Critical Issues1. Import Error:
|
@willg-nv could you fix 1,2,4 and split the autotuner.py as suggested? |
Signed-off-by: Will Guo <willg@nvidia.com>
1, I think this is a false alarm for get_tensor_consumer_node_indices is in current repo. So there should be no ImportError 2: this issue does not exists in latest code. |
please check comment #838 (comment) |
|
/ok to test c99edab |
Signed-off-by: Will Guo <willg@nvidia.com>
Head branch was pushed to by a user without write access
Security Review: PRs #838 and #839SummaryBoth PRs have been reviewed against the security checklist. One security issue was identified in PR #839 with a PR #838 - Integrate Automated QDQ autotuner - part 3.2Security Check Results
Code ReviewThe PR uses Status: No security issues identified. PR #839 - Integrate Automated QDQ placement tool - part 3.3🔴 CRITICAL Security Issue FoundIssue: Unauthorized use of
|
| Check | Status | Details |
|---|---|---|
| torch.load(..., weights_only=False) | ✅ PASS | Not used in this PR |
| numpy.load(..., allow_pickle=True) | ✅ PASS | Not used in this PR |
| trust_remote_code=True | ✅ PASS | Not used in this PR |
| eval() / exec() on external input | ✅ PASS | Not found |
| # nosec bypass | 🔴 FAIL | Found unauthorized # nosec B108 comment |
Explanation
The # nosec B108 comment is used to bypass Bandit's B108 security check (probable insecure temporary file creation). According to the security policy:
"If a security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved by @NVIDIA/modelopt-setup-codeowners with an explicit justification in the PR description."
This PR does not include:
- Approval from @NVIDIA/modelopt-setup-codeowners
- An explicit justification in the PR description for why this bypass is safe
Recommended Fix
# Option 1: Remove the nosec comment and use a safer path
DEFAULT_TIMING_CACHE = os.path.join(tempfile.gettempdir(), "trtexec_timing.cache")
# Option 2: If B108 bypass is genuinely required, add justification:
# nosec B108: This path is used for TensorRT timing cache which is
# created by trtexec and does not contain sensitive data.
# Reviewed and approved by: @NVIDIA/modelopt-setup-codeowners
DEFAULT_TIMING_CACHE = "/tmp/trtexec_timing.cache"Additional Security Notes
The PR uses yaml.safe_load() in PatternCache.load() and state loading, which is the secure method. No other security issues were identified.
Overall Recommendation
| PR | Status | Required Action |
|---|---|---|
| #838 | ✅ Approved | No security issues; can merge |
| #839 | 🔴 Request Changes | Remove or justify # nosec B108 comment with approval from @NVIDIA/modelopt-setup-codeowners |
modelopt/onnx/quantization/autotune/main.py is not in this PR |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #838 +/- ##
==========================================
- Coverage 73.09% 72.19% -0.90%
==========================================
Files 205 210 +5
Lines 22301 23488 +1187
==========================================
+ Hits 16300 16957 +657
- Misses 6001 6531 +530 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Will Guo <willg@nvidia.com>
|
/ok to test c1f93b2 |
|
/ok to test c1f93b2 |
## What does this PR do?
This PR implements QDQ autotuner CLI. This is the initial version of
CLI, it will be integrated to modelopt.onnx.quantization.autotune.
Usage:
```
python -m modelopt.onnx.quantization.autotune
--onnx_path model.onnx --schemes_per_region 50
--pattern_cache cache.yaml --qdq_baseline baseline.onnx
--quant_type int8 --verbose
```
PR 3.1: #837
PR 3.2 #838
PR 3.3: #839
**Overview:** ?
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes
- **Did you write any new necessary tests?**: No
- **Did you add or update any necessary documentation?**: Document will
be added in part 4.
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
CHANGE log will be added in part 4.
## Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Release Notes
* **New Features**
* Added a command-line interface for ONNX quantization autotuning with
configurable parameters for models, output paths, quantization
strategies, and TensorRT benchmarking.
* Introduced an automated workflow for pattern-based region optimization
with state management, baseline comparison, and benchmarking
capabilities.
<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Will Guo <willg@nvidia.com>
What does this PR do?
This PR implements QDQAutotuner class. This class is used to drive the main Autotuner workflow.
The workflow is:
This PR is part 2/4 of #703.
PR 3.1: #837
PR 3.2 #838
PR 3.3: #839
Overview: ?
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.